Contents: Applied Statistics Project

  1. Part-A: Solution
  2. Part-B: Solution
  3. Part-C: Solution

Part-A: Solution¶

In [1]:
# Import all the libraries needed to complete the analysis, visualization, modeling and presentation
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import matplotlib.patches as mpatches

1. Please refer the table below to answer below questions:¶

Q1.png

The Background:

  • Contingency tables classify outcomes in rows and columns. Table cells at the intersections of rows and columns indicate frequencies of both events coinciding.

  • Consequently, to calculate joint probabilities in a contingency table, take each cell count and divide by the grand total.

  • In contingency tables, you can locate the marginal probabilities in the row and column totals. Statisticians refer to them as marginal probabilities because you find them in the margins of contingency tables!

  • Fortunately, using contingency tables to calculate conditional probabilities is straightforward. It’s merely a matter of dividing a cell value by a row or column total.

Probability of A given B is written as P(A | B).

$P(A\mid B) = \frac{P(A \quad and \quad B)}{P(B)}$

where P(A and B) = Joint probability of A and B

P(A) = Marginal probability of A P(B) = Marginal probability of B

1A. Refer above table and find the joint probability of the people who planned to purchase and actually placed an order.¶

In [2]:
# The joint probability of the people who planned to purchase and actually placed an order
P1 = 400/2000
P1
Out[2]:
0.2

1B. Refer to the above table and find the joint probability of the people who planned to purchase and actually placed an order, given that people planned to purchase.¶

In [3]:
# The joint probability of the people who planned to purchase and actually placed an order, given that
# people planned to purchase.
P2 = 400/500
P2
Out[3]:
0.8

2. An electrical manufacturing company conducts quality checks at specified periods on the products it manufactures. Historically, the failure rate for the manufactured item is 5%. Suppose a random sample of 10 manufactured items is selected. Answer the following questions.¶

  • Binomial Distribution

    It is widely used probability distribution of a discrete random variable.

    Plays major role in quality control and quality assurance function.

    $P(X = x\mid n,\pi)$ = $\frac{n!}{x!(n - x)!}\pi^x (1 - \pi)^{n-x} $

    • where P(X = x) is the probability of getting x successes in n trials

    and $\pi$ is the probability of an event of interest

    Some important functions in Python for Binomial distribution:

    1. Probability mass function

    scipy.stats.binom.pmf gives the probability mass function for the binomial distribution binomial = scipy.stats.binom.pmf (k,n,p), where k is an array and takes values in {0, 1,..., n} n and p are shape parameters for the binomial distribution The output, binomial, gives probability of binomial distribution function in terms of array.

    1. Cumulative Density function

    cumbinomial = scipy.stats.binom.cdf(k,n,p) gives cumulative binomial distribution. The output, cumbinomial, gives cumulative probability of binomial distribution function in terms of array.

    1. Plot the binomial Density function

    The function, matplotlib.pyplot.plot(k, binomial, ‘o-’) gives us plot of the binomial distribution function.

In [4]:
# The failure rate for the manufactured item is 5%
p = 0.05
q = 1 - p
n = 10
k=np.arange(0,8)
k
Out[4]:
array([0, 1, 2, 3, 4, 5, 6, 7])

2A. Probability that none of the items are defective?¶

In [5]:
# Considering k = 0
# NoneDefective = P(0)
NoneDefective = stats.binom.pmf(0,n,p)
NoneDefective
Out[5]:
0.5987369392383787

2B. Probability that exactly one of the items is defective¶

In [6]:
# Considering k = 1
# OneDefectivee = P(1)
OneDefective = stats.binom.pmf(1,n,p)
OneDefective
Out[6]:
0.3151247048623047

2C. Probability that two or fewer of the items are defective?¶

In [7]:
#Cosidering k =2
TwoDefective = stats.binom.pmf(2,n,p)
TwoDefective
Out[7]:
0.07463479852001963
In [8]:
# Two or fewer of the manufactured items are defective = P(0) + P(1) + P(2)
TwoOrFewerDefective = NoneDefective + OneDefective + TwoDefective
TwoOrFewerDefective
Out[8]:
0.988496442620703

2D. Probability that three or more of the items are defective¶

In [9]:
# Considering the sum of probabilites as 1

ThreeDefective = stats.binom.pmf(3,n,p)

ThreeOrMoreDefective = 1 - (NoneDefective + OneDefective + TwoDefective + ThreeDefective)
ThreeOrMoreDefective
Out[9]:
0.0010284979378907266

3. A car salesman sells on an average 3 cars per week.¶

  • Poissson Distribution

    This discrete distribution which also plays a major role in quality control.

    The Poisson distribution is a discrete probability distribution for the counts of events that occur randomly in a given interval of time or space. In such areas of opportunity, there can be more than one occurrence. In such situations, Poisson distribution can be used to compute probabilities.

    Examples include number of defects per item, number of defects per transformer produced. Notes: Poisson Distribution helps to predict the arrival rate in a waiting line situation where a queue is formed and people wait to be served and the service rate is generally higher than the arrival rate.

  • Properties:

    Mean μ = λ

    Standard deviation σ = √ μ

    The Poisson distribution is the limit of binomial distribution as n approaches ∞and p approaches 0

    P(X = x) = $\frac{e^\lambda \lambda^x}{x!} $ where

    • P(x) = Probability of x successes given an idea of $\lambda$
    • $\lambda$ = Average number of successes
    • e = 2.71828 (based on natural logarithm)
    • x = successes per unit which can take values 0,1,2,3,... $\infty$

    Applications

    Car Accidents

    Number of deaths by horse kicking in Prussian Army (first application)

    Birth defects and genetic mutation

    Note

    If there is a fixed number of observations, n, each of which is classified as an event of interest or not an event of interest, use the binomial distribution.

    If there is an area of opportunity, use the Poisson distribution.

In [10]:
# Average rate is denoted with lambda in the formula for poisson distribution 

rate = 3
In [11]:
# Using a numpy array to save different no. of successes ranging from 0 to 19 to construct a probability distribution

n=np.arange(0,20)
In [12]:
# Calculating the distribution and storing the distribution of probablitites in an array

PoissonPmf = stats.poisson.pmf(n,rate)

3A. What is Probability that in a given week he will sell some cars?¶

In [13]:
# Assuming the sum of probabilites as 1

SellSomeCars = 1 - PoissonPmf[0]
SellSomeCars
Out[13]:
0.950212931632136

3B. What is Probability that in a given week he will sell 2 or more but less than 5 cars?¶

In [14]:
SellTwoToFive = PoissonPmf[2]+PoissonPmf[3]+PoissonPmf[4]
SellTwoToFive
Out[14]:
0.6161149710523164

3C. Plot the poisson distribution function for cumulative probability of cars sold per-week vs number of cars sold per week (Plotting both the CDF and PMF here.)¶

In [15]:
# Now we can create an array with Poisson cumulative probability values:

from scipy.stats import poisson

cdf = poisson.cdf(n, mu=3)
cdf = np.round(cdf, 3)

print(cdf)
[0.05  0.199 0.423 0.647 0.815 0.916 0.966 0.988 0.996 0.999 1.    1.
 1.    1.    1.    1.    1.    1.    1.    1.   ]
In [16]:
# If you want to print it in a nicer way with each 'n' value and the corresponding cumulative probability:

for val, prob in zip(n,cdf):
    print(f"n-value {val} has probability = {prob}")
n-value 0 has probability = 0.05
n-value 1 has probability = 0.199
n-value 2 has probability = 0.423
n-value 3 has probability = 0.647
n-value 4 has probability = 0.815
n-value 5 has probability = 0.916
n-value 6 has probability = 0.966
n-value 7 has probability = 0.988
n-value 8 has probability = 0.996
n-value 9 has probability = 0.999
n-value 10 has probability = 1.0
n-value 11 has probability = 1.0
n-value 12 has probability = 1.0
n-value 13 has probability = 1.0
n-value 14 has probability = 1.0
n-value 15 has probability = 1.0
n-value 16 has probability = 1.0
n-value 17 has probability = 1.0
n-value 18 has probability = 1.0
n-value 19 has probability = 1.0
In [17]:
# Using matplotlib library, we can easily plot the Poisson CDF using Python:

plt.plot(n, cdf, marker='o')
plt.title('Poisson: $\lambda$ = %i ' % rate)
plt.xlabel('Number of Cars Sold per Week')
plt.ylabel('Cumulative Probability of Number of Cars Sold per Week')

plt.show()
No description has been provided for this image
In [18]:
# Printing the probability distribution for different values of x

PoissonPmf
Out[18]:
array([4.97870684e-02, 1.49361205e-01, 2.24041808e-01, 2.24041808e-01,
       1.68031356e-01, 1.00818813e-01, 5.04094067e-02, 2.16040315e-02,
       8.10151179e-03, 2.70050393e-03, 8.10151179e-04, 2.20950322e-04,
       5.52375804e-05, 1.27471339e-05, 2.73152870e-06, 5.46305740e-07,
       1.02432326e-07, 1.80762929e-08, 3.01271548e-09, 4.75691918e-10])
In [19]:
# Using matplotlib library, we can easily plot the Poisson PMF using Python:

plt.plot(n,PoissonPmf,'o-')
plt.title('Poisson: $\lambda$ = %i ' % rate)
plt.xlabel('Number of Cars Sold per Week')
plt.ylabel('Probability of Number of Cars Sold per Week')
plt.show()
No description has been provided for this image

4. Accuracy in understanding orders for a speech based bot at a restaurant is important for the Company X which has designed, marketed and launched the product for a contactless delivery due to the COVID-19 pandemic. Recognition accuracy that measures the percentage of orders that are taken correctly is 86.8%. Suppose that you place an order with the bot and two friends of yours independently place orders with the same bot. Answer the following questions.¶

Because there are three orders and the probability of a correct order is 0.868. Using Binomial distribution equation,

    P(X = 3|n =3, pi given = 0.868)
    3!/ 3!(3-3)! * (0.868)^3 * (1-0.868)^3-3 = 0.6540

    Likewise, calculate X= 0, X=2
     (X = 0) = 0.0023
     (X = 2) = 0.2984

    Hence, P(X>=2) = P(X=2)+P(X=3) = 0.9524

  • The probability that all the three orders are filled correctly is 0.6540, 65.4%
  • The probability that none of the orders are filled correctly is 0.0023, 0.23%
  • The probability that atleat two of the three are filled correctly is 0.9524, 95.24%
In [20]:
# Since there are only 2 events orders filled correctly and not filled correctly
# and the number of trials is 3, we can use Binomial distribution with following parameters:

p1 = 0.868
q1 = 1 - p1
n1 = 3

4A. What is the probability that all three orders will be recognised correctly?¶

In [21]:
ThreeOrdersRecognisedCorrect = stats.binom.pmf(3,3,q1)
ThreeOrdersRecognisedCorrect
Out[21]:
0.0022999680000000003

4B. What is the probability that none of the three orders will be recognised correctly?¶

In [22]:
NoneThreeOrdersRecognisedCorrect = stats.binom.pmf(0,3,p1)
NoneThreeOrdersRecognisedCorrect
Out[22]:
0.0022999680000000003

4C. What is the probability that at least two of the three orders will be recognised correctly?¶

In [23]:
# We need to find the (1 - Probability of upto 1 Failure)

TwoOfThreeRecognisedCorrect = 1 - stats.binom.pmf(1,3,p1)
TwoOfThreeRecognisedCorrect
Out[23]:
0.954627904

5. Explain 1 real life industry scenario (other than the ones mentioned above) where you can use the concepts learnt in this module of Applied Statistics to get data driven business solution.¶

We can use statistics to evaluate potential new versions of a children’s dry cereal. We can use taste tests to provide valuable statistical information on what customers want from a product. The four key factors that product developers may consider to enhance the taste of the cereal are the following:

  1. Ratio of wheat to corn in the cereal flake
  2. Type of sweetener: sugar, honey, artificial or sugar free
  3. Presence or absence of flavour in the cereal - Fruits, Vegetables, Spices
  4. Cooking time - Short or Long

We should design an experiment to determine what effects these four factors had on cereal taste. For example, one test cereal can be made with a specified ratio of wheat to corn, sugar as the sweetener, flavour bits, and a short cooking time; another test cereal can be made with a different ratio of wheat to corn and the other three factors the same, and so on. Groups of children then taste-test the cereals and state what they think about the taste of each.

The Analysis of variance (ANOVA) is the statistical method we can use to study the data obtained from the taste tests. The results of the analysis may show the following:

  • The flake composition and sweetener type were highly influential in taste evaluation.
  • The flavour bits actually reduced the taste of the cereal.
  • The cooking time had no effect on the taste.

This information can be vital to identify the factors that would lead to the best-tasting cereal. The same information can be used by the marketing and manufacturing teams, and for a better product development strategy.

Tools to be used: Python, R, Minitab, Excel, MS SQL Server

References:

  • ANOVA
  • BURKE MARKETING SERVICES

Part-B: Solution¶

  • DOMAIN: Sports
  • CONTEXT: Company X manages the men's top professional basketball division of the American league system.

The dataset contains information on all the teams that have participated in all the past tournaments. It has data about how many baskets each team scored, conceded, how many times they came within the first 2 positions, how many tournaments they have qualified, their best position in the past, etc.

  • DATA DESCRIPTION: Basketball.csv - The data set contains information on all the teams so far participated in

all the past tournaments.

  • DATA DICTIONARY:
  1. Team: Team’s name
  2. Tournament: Number of played tournaments.
  3. Score: Team’s score so far.
  4. PlayedGames: Games played by the team so far.
  5. WonGames: Games won by the team so far.
  6. DrawnGames: Games drawn by the team so far.
  7. LostGames: Games lost by the team so far.
  8. BasketScored: Basket scored by the team so far.
  9. BasketGiven: Basket scored against the team so far.
  10. TournamentChampion: How many times the team was a champion of the tournaments so far.
  11. Runner-up: How many times the team was a runners-up of the tournaments so far.
  12. TeamLaunch: Year the team was launched on professional basketball.
  13. HighestPositionHeld: Highest position held by the team amongst all the tournaments played.
  • PROJECT OBJECTIVE: Company’s management wants to invest on proposal on managing some of the best

teams in the league. The analytics department has been assigned with a task of creating a report on the performance shown by the teams. Some of the older teams are already in contract with competitors. Hence Company X wants to understand which teams they can approach which will be a deal win for them. ___

Step-1: Read, Clean, and Prepare Dataset to be used for EDA¶

Import the Relevant Libraries¶

In [24]:
# Import all the libraries needed to complete the analysis, visualization, modeling and presentation
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import matplotlib.patches as mpatches

# Configure for any default setting of any library
%matplotlib inline
sns.set(style='darkgrid', palette='deep', font='sans-serif', font_scale=1.3, color_codes=True)

Some Comments about the Libraries:

  • %matplotlib inline sets the backend of matplotlib to the 'inline' backend: With this backend, the output of plotting commands is displayed inline without needing to call plt.show() every time a data is plotted.
  • Set few of the Seaborn's asthetic parameters

Read the Dataset¶

In [25]:
basketball=pd.read_csv("Basketball.csv");

Shape of the Dataset¶

In [26]:
#There are 61 Observations / Rows and 13 Attributes / Columns.

basketball.shape
Out[26]:
(61, 13)

Check Information about the Data/Data types of all Attributes¶

In [27]:
basketball.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Team                 61 non-null     object
 1   Tournament           61 non-null     int64 
 2   Score                61 non-null     object
 3   PlayedGames          61 non-null     object
 4   WonGames             61 non-null     object
 5   DrawnGames           61 non-null     object
 6   LostGames            61 non-null     object
 7   BasketScored         61 non-null     object
 8   BasketGiven          61 non-null     object
 9   TournamentChampion   61 non-null     object
 10  Runner-up            61 non-null     object
 11  TeamLaunch           61 non-null     object
 12  HighestPositionHeld  61 non-null     int64 
dtypes: int64(2), object(11)
memory usage: 6.3+ KB

Data Cleaning¶

In [28]:
# Check the head of the dataset

basketball.head()
Out[28]:
Team Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion Runner-up TeamLaunch HighestPositionHeld
0 Team 1 86 4385 2762 1647 552 563 5947 3140 33 23 1929 1
1 Team 2 86 4262 2762 1581 573 608 5900 3114 25 25 1929 1
2 Team 3 80 3442 2614 1241 598 775 4534 3309 10 8 1929 1
3 Team 4 82 3386 2664 1187 616 861 4398 3469 6 6 1931to32 1
4 Team 5 86 3368 2762 1209 633 920 4631 3700 8 7 1929 1
In [29]:
# Check the tail of the dataset

basketball.tail()
Out[29]:
Team Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion Runner-up TeamLaunch HighestPositionHeld
56 Team 57 1 34 38 8 10 20 38 66 - - 2009-10 20
57 Team 58 1 22 30 7 8 15 37 57 - - 1956-57 16
58 Team 59 1 19 30 7 5 18 51 85 - - 1951~52 16
59 Team 60 1 14 30 5 4 21 34 65 - - 1955-56 15
60 Team 61 1 - - - - - - - - - 2017~18 9
In [30]:
# Check for null values
# No Null values in data

basketball.isna().sum()
Out[30]:
Team                   0
Tournament             0
Score                  0
PlayedGames            0
WonGames               0
DrawnGames             0
LostGames              0
BasketScored           0
BasketGiven            0
TournamentChampion     0
Runner-up              0
TeamLaunch             0
HighestPositionHeld    0
dtype: int64
In [31]:
# Check for duplicates in data
# We do not have duplicates

dupes = basketball.duplicated()
sum(dupes)
Out[31]:
0
In [32]:
# Generate descriptive statistics.

# Descriptive statistics include those that summarize the central
# tendency, dispersion and shape of a
# dataset's distribution, excluding ``NaN`` values.

# Analyzes both numeric and object series, as well
# as ``DataFrame`` column sets of mixed data types. The output
# will vary depending on what is provided.

basketball.describe(include="all")
Out[32]:
Team Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion Runner-up TeamLaunch HighestPositionHeld
count 61 61.000000 61 61 61 61 61 61 61 61 61 61 61.000000
unique 61 NaN 61 53 59 57 56 60 61 8 10 47 NaN
top Team 1 NaN 4385 2762 7 14 37 70 3140 - - 1929 NaN
freq 1 NaN 1 3 2 2 3 2 1 52 48 10 NaN
mean NaN 24.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7.081967
std NaN 26.827225 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5.276663
min NaN 1.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.000000
25% NaN 4.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3.000000
50% NaN 12.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 6.000000
75% NaN 38.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10.000000
max NaN 86.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 20.000000

Some Insights:

  • Most of the values are showing NaN issue, so from the above dataset insights (Head, Tail, Descriptive Statistics), there would be some missing values.

  • We can see difference in the input data for TeamLaunch, TournamentChampion, and Runner-up columns. So lets inspect data in these 3 columns.

  • Also lets make the column Names consistent: Runner-up > RunnerUp

In [33]:
# Rename the columns for consistency

basketball.rename(columns = {'Runner-up': 'RunnerUp'}, inplace = True)
In [34]:
# Check for unique values in various columns

print("TournamentChampion",basketball['TournamentChampion'].unique())
print("RunnerUp",basketball['RunnerUp'].unique())
print("TeamLaunch",basketball['TeamLaunch'].unique())
print("Team",basketball['Team'].unique())
print("Score",basketball['Score'].unique())
TournamentChampion ['33' '25' '10' '6' '8' '1' '-' '2']
RunnerUp ['23' '25' '8' '6' '7' '4' '-' '3' '1' '5']
TeamLaunch ['1929' '1931to32' '1934-35' '1939-40' '1932-33' '1941to42' '1948-49'
 '1944_45' '1935-36' '1949_50' '1933to34' '1960-61' '1951-52' '1998-99'
 '1941-42' '1977-78' '1959-60' '2004to05' '1961-62' '1940-41' '1930-31'
 '1963-64' '1974-75' '1943-44' '1987-88' '1991_92' '2007-08' '1962-63'
 '1994-95' '1978-79' '1971-72' '1999to00' '2014-15' '1990-91' '1947-48'
 '1996-97' '1995-96' '1945-46' '1953-54' '1979-80' '1950-51' '2016_17'
 '2009-10' '1956-57' '1951~52' '1955-56' '2017~18']
Team ['Team 1' 'Team 2' 'Team 3' 'Team 4' 'Team 5' 'Team 6' 'Team 7' 'Team 8'
 'Team 9' 'Team 10' 'Team 11' 'Team 12' 'Team 13' 'Team 14' 'Team 15'
 'Team 16' 'Team 17' 'Team 18' 'Team 19' 'Team 20' 'Team 21' 'Team 22'
 'Team 23' 'Team 24' 'Team 25' 'Team 26' 'Team 27' 'Team 28' 'Team 29'
 'Team 30' 'Team 31' 'Team 32' 'Team 33' 'Team 34' 'Team 35' 'Team 36'
 'Team 37' 'Team 38' 'Team 39' 'Team 40' 'Team 41' 'Team 42' 'Team 43'
 'Team 44' 'Team 45' 'Team 46' 'Team 47' 'Team 48' 'Team 49' 'Team 50'
 'Team 51' 'Team 52' 'Team 53' 'Team 54' 'Team 55' 'Team 56' 'Team 57'
 'Team 58' 'Team 59' 'Team 60' 'Team 61']
Score ['4385' '4262' '3442' '3386' '3368' '2819' '2792' '2573' '2109' '1884'
 '1814' '1789' '1471' '1416' '1389' '1351' '1314' '1174' '1148' '1020'
 '970' '667' '662' '606' '553' '538' '510' '445' '421' '416' '375' '353'
 '343' '293' '285' '277' '242' '230' '190' '188' '168' '150' '148' '132'
 '107' '96' '91' '83' '81' '76' '71' '56' '52' '42' '40' '35' '34' '22'
 '19' '14' '-']

Some Observations:

  • Every row in dataset represents a team, so we have 61 teams to analyze.
  • For easy data handling and visualization, we can rename the Team as T in the 'Team' Column.
  • There is a special character in data i.e '_'. We should eliminate/replace this missing value.
  • For TeamLaunch year, we can consider just the initial year for easy data cleaning and analysis.
  • Data types of most of the columns are Objects; As values are in integers, it would be beneficial to do the type conversion here.
  • Team 61 seems to have no valid data: It participated in 1 tournament and got 9th position. No information about games or baskets. Lets drop it.
In [35]:
# Make a copy of the original dataset
basketballOriginal=basketball.copy(deep=True); 

# For TeamLaunch year, we can consider just the initial year for easy data cleaning and analysis.
basketball['TeamLaunch'] = basketballOriginal['TeamLaunch'].str.slice(0,4);

# We can fill Zero in place of missing value '-'. 
# Because if HighestPositionHeld is not equal to 1 or 2 then TournamentChampion, Runner-up fields are zero too.
basketball=basketball.replace('-',0);

# For easy data handling and visualization, we can rename the Team as T in the 'Team' Column.
basketball['Team']=basketball['Team'].str.replace('Team ','T');
columns = basketball.columns.drop('Team');

# Data types of most of the columns are Objects; 
# As values are in integers, it would be beneficial to do the type conversion here.
basketball[columns] = basketball[columns].apply(pd.to_numeric, errors='coerce')

# Team 61 seems to have no valid data: It participated in 1 tournament and got 9th position. 
# No information about games or baskets. We can safely drop it.
basketball.drop(60,inplace=True)

Final Dataset after Data Cleaning¶

In [36]:
# Check the head of the dataset after cleaning

basketball.head(5)
Out[36]:
Team Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion RunnerUp TeamLaunch HighestPositionHeld
0 T1 86 4385 2762 1647 552 563 5947 3140 33 23 1929 1
1 T2 86 4262 2762 1581 573 608 5900 3114 25 25 1929 1
2 T3 80 3442 2614 1241 598 775 4534 3309 10 8 1929 1
3 T4 82 3386 2664 1187 616 861 4398 3469 6 6 1931 1
4 T5 86 3368 2762 1209 633 920 4631 3700 8 7 1929 1
In [37]:
# Check the tail of the dataset after cleaning

basketball.tail(5)
Out[37]:
Team Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion RunnerUp TeamLaunch HighestPositionHeld
55 T56 1 35 38 8 11 19 36 55 0 0 2016 17
56 T57 1 34 38 8 10 20 38 66 0 0 2009 20
57 T58 1 22 30 7 8 15 37 57 0 0 1956 16
58 T59 1 19 30 7 5 18 51 85 0 0 1951 16
59 T60 1 14 30 5 4 21 34 65 0 0 1955 15
In [38]:
basketball.describe(include="all")
Out[38]:
Team Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion RunnerUp TeamLaunch HighestPositionHeld
count 60 60.000000 60.000000 60.000000 60.000000 60.000000 60.000000 60.000000 60.000000 60.00000 60.000000 60.000000 60.000000
unique 60 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
top T1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
freq 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean NaN 24.383333 916.450000 810.100000 309.033333 192.083333 308.816667 1159.350000 1159.233333 1.45000 1.433333 1957.950000 7.050000
std NaN 26.884620 1138.342899 877.465393 408.481395 201.985508 294.508639 1512.063948 1163.946914 5.51554 4.574679 26.646732 5.315232
min NaN 1.000000 14.000000 30.000000 5.000000 4.000000 15.000000 34.000000 55.000000 0.00000 0.000000 1929.000000 1.000000
25% NaN 4.000000 104.250000 115.500000 34.750000 26.250000 62.750000 154.500000 236.000000 0.00000 0.000000 1934.750000 3.000000
50% NaN 12.000000 395.500000 424.500000 124.000000 98.500000 197.500000 444.000000 632.500000 0.00000 0.000000 1950.500000 6.000000
75% NaN 39.000000 1360.500000 1345.500000 432.750000 331.500000 563.500000 1669.750000 2001.250000 0.00000 0.000000 1977.250000 10.000000
max NaN 86.000000 4385.000000 2762.000000 1647.000000 633.000000 1070.000000 5947.000000 3889.000000 33.00000 25.000000 2016.000000 20.000000

Step-2: Univariate Analysis¶

Univariate analysis refer to the analysis of a single variable. The main purpose of univariate analysis is to summarize and find patterns in the data. The key point is that there is only one variable involved in the analysis.

Basic Statistics¶

In [39]:
# Print the mean of each attribute. Ignore "Team as it is not a continuous variable"

print("basketball:",basketball.mean())
basketball: Tournament               24.383333
Score                   916.450000
PlayedGames             810.100000
WonGames                309.033333
DrawnGames              192.083333
LostGames               308.816667
BasketScored           1159.350000
BasketGiven            1159.233333
TournamentChampion        1.450000
RunnerUp                  1.433333
TeamLaunch             1957.950000
HighestPositionHeld       7.050000
dtype: float64
In [40]:
# Print the median values of the basketball.
# Observe that the values of mean and median are not the same for most of the attributes.

print("basketball:",basketball.median())
basketball: Tournament               12.0
Score                   395.5
PlayedGames             424.5
WonGames                124.0
DrawnGames               98.5
LostGames               197.5
BasketScored            444.0
BasketGiven             632.5
TournamentChampion        0.0
RunnerUp                  0.0
TeamLaunch             1950.5
HighestPositionHeld       6.0
dtype: float64
In [41]:
# Prints the mode of the attribute. The Attribute is unimodal

print(basketball['HighestPositionHeld'].mode())
0    1
dtype: int64
In [42]:
# Prints the value below which 25% of the data lies

print("Data_quantile(25%):",basketball.quantile(q=0.25))
Data_quantile(25%): Tournament                4.00
Score                   104.25
PlayedGames             115.50
WonGames                 34.75
DrawnGames               26.25
LostGames                62.75
BasketScored            154.50
BasketGiven             236.00
TournamentChampion        0.00
RunnerUp                  0.00
TeamLaunch             1934.75
HighestPositionHeld       3.00
Name: 0.25, dtype: float64
In [43]:
# Prints the value below which 50% of the data lies

print("Data_quantile(50%):",basketball.quantile(q=0.50))
Data_quantile(50%): Tournament               12.0
Score                   395.5
PlayedGames             424.5
WonGames                124.0
DrawnGames               98.5
LostGames               197.5
BasketScored            444.0
BasketGiven             632.5
TournamentChampion        0.0
RunnerUp                  0.0
TeamLaunch             1950.5
HighestPositionHeld       6.0
Name: 0.5, dtype: float64
In [44]:
# Prints the value below which 75% of the data lies

print("Data_quantile(75%):",basketball.quantile(q=0.75))
Data_quantile(75%): Tournament               39.00
Score                  1360.50
PlayedGames            1345.50
WonGames                432.75
DrawnGames              331.50
LostGames               563.50
BasketScored           1669.75
BasketGiven            2001.25
TournamentChampion        0.00
RunnerUp                  0.00
TeamLaunch             1977.25
HighestPositionHeld      10.00
Name: 0.75, dtype: float64
In [45]:
# The below output represents the IQR values for all the attributes

basketball.quantile(0.75) - basketball.quantile(0.25)
Out[45]:
Tournament               35.00
Score                  1256.25
PlayedGames            1230.00
WonGames                398.00
DrawnGames              305.25
LostGames               500.75
BasketScored           1515.25
BasketGiven            1765.25
TournamentChampion        0.00
RunnerUp                  0.00
TeamLaunch               42.50
HighestPositionHeld       7.00
dtype: float64
In [46]:
# Range: The difference between the highest value and lowest values for all individual attributes
# columns = basketball.columns.drop('Team')

print(basketball[columns].max() - basketball[columns].min())
Tournament               85
Score                  4371
PlayedGames            2732
WonGames               1642
DrawnGames              629
LostGames              1055
BasketScored           5913
BasketGiven            3834
TournamentChampion       33
RunnerUp                 25
TeamLaunch               87
HighestPositionHeld      19
dtype: int64
In [47]:
# The below output says how much was the data dispersion

print(basketball.var())
Tournament             7.227828e+02
Score                  1.295825e+06
PlayedGames            7.699455e+05
WonGames               1.668570e+05
DrawnGames             4.079815e+04
LostGames              8.673534e+04
BasketScored           2.286337e+06
BasketGiven            1.354772e+06
TournamentChampion     3.042119e+01
RunnerUp               2.092768e+01
TeamLaunch             7.100483e+02
HighestPositionHeld    2.825169e+01
dtype: float64
In [48]:
# The below output says how much the data deviated from the mean.

print(basketball.std())
Tournament               26.884620
Score                  1138.342899
PlayedGames             877.465393
WonGames                408.481395
DrawnGames              201.985508
LostGames               294.508639
BasketScored           1512.063948
BasketGiven            1163.946914
TournamentChampion        5.515540
RunnerUp                  4.574679
TeamLaunch               26.646732
HighestPositionHeld       5.315232
dtype: float64
In [49]:
# Understand the Skewness of the data
# Positively skewed: Most frequent values are low and tail is towards the high values.

basketball.skew()
Out[49]:
Tournament             1.197176
Score                  1.574104
PlayedGames            1.123454
WonGames               1.786067
DrawnGames             0.984899
LostGames              0.880596
BasketScored           1.758058
BasketGiven            0.958164
TournamentChampion     4.734845
RunnerUp               4.321794
TeamLaunch             0.682570
HighestPositionHeld    0.832164
dtype: float64
In [50]:
# Understand the Kurtosis of the data

basketball.kurtosis()
Out[50]:
Tournament              0.176135
Score                   1.686503
PlayedGames            -0.030398
WonGames                2.577189
DrawnGames             -0.422334
LostGames              -0.456038
BasketScored            2.406123
BasketGiven            -0.451057
TournamentChampion     23.326192
RunnerUp               19.482117
TeamLaunch             -0.787472
HighestPositionHeld    -0.284934
dtype: float64

Histogram for checking the Distribution, Skewness¶

In [51]:
# Check for distribution, skewness

# This function combines the matplotlib hist function (with automatic calculation of a 
# good default bin size) with the seaborn kdeplot() and rugplot() functions. It can 
# also fit scipy.stats distributions and plot the estimated PDF over the data.

select = ['Tournament','Score','PlayedGames','WonGames','DrawnGames','LostGames',
          'BasketScored','BasketGiven','TournamentChampion','RunnerUp','TeamLaunch',
          'HighestPositionHeld']
plt.figure(figsize=(20,20))
index = 1
for col in basketball[select]:
    plt.subplot(4,3,index)
    sns.distplot(basketball[col], rug=True, kde=True,
                 rug_kws={"color": "r"},
                 kde_kws={"color": "k"},
                 hist_kws={"color": "c"})
    index += 1
No description has been provided for this image

Box Plot to understand the Distribution¶

In [52]:
# Draw a box plot to show distributions with respect to categories.

# A box plot (or box-and-whisker plot) shows the distribution of quantitative
# data in a way that facilitates comparisons between variables or across
# levels of a categorical variable. The box shows the quartiles of the
# dataset while the whiskers extend to show the rest of the distribution,
# except for points that are determined to be "outliers" using a method
# that is a function of the inter-quartile range.

fig, ax = plt.subplots(3, 4)

fig.set_figheight(5)
fig.set_figwidth(15)

sns.boxplot(x=basketball['PlayedGames'],ax=ax[0][0]);
sns.boxplot(x=basketball['WonGames'],ax=ax[0][1]);
sns.boxplot(x=basketball['TournamentChampion'],ax=ax[0][2]);
sns.boxplot(x=basketball['RunnerUp'],ax=ax[0][3]);

sns.boxplot(x=basketball['TeamLaunch'],ax=ax[1][0]);
sns.boxplot(x=basketball['HighestPositionHeld'],ax=ax[1][1]);
sns.boxplot(x=basketball['BasketScored'],ax=ax[1][2]);
sns.boxplot(x=basketball['BasketGiven'],ax=ax[1][3]);

sns.boxplot(x=basketball['Tournament'],ax=ax[2][0]);
sns.boxplot(x=basketball['Score'],ax=ax[2][1]);
sns.boxplot(x=basketball['DrawnGames'],ax=ax[2][2]);
sns.boxplot(x=basketball['LostGames'],ax=ax[2][3]);

fig.tight_layout() 
plt.show()
No description has been provided for this image

Understand the complete Dataset Distribution¶

In [53]:
# Consider the distibution of all the quantitative variables

plt.figure(figsize=(10,5))

columns = basketball.columns.drop('Team');
basketballColumns = basketball[columns];
sns.distplot(basketballColumns);
plt.show()
No description has been provided for this image
In [54]:
# Cumulative distribution for all the quantitative variables

plt.figure(figsize=(10,5))
sns.distplot(basketballColumns, hist_kws=dict(cumulative=True), kde_kws=dict(cumulative=True));
plt.show()
No description has been provided for this image

Important Insights:

  • Positively skewed data: Most frequent values are low and tail is towards the high values.
  • Similar type of Box plots for Score, BasketScored, and WonGames. Also we have some outliers here.
  • TournamentChampion and RunnerUP have a zero width box as 75% data has value as Zero. Remaining are outliers.
  • Most of the values are scattered between the median and Upper Quartile which again shows that the data is positively skewed.
  • No outliers can be seen in PlayedGames, TeamLaunch, HighestPositionHeld, BasketGiven, Tournament, DrawnGames, and LostGames variables.

Explore the 'Score' Variable¶

In [55]:
# The following code plots a histrogram using the matplotlib package.
# The bins argument creates class intervals. In this case we are creating 50 such intervals

# In the above histogram, the first array is the frequency in each class and the second array 
# contains the edges of the class intervals. These arrays can be assigned to a variable and 
# used for further analysis.

plt.hist(basketball['Score'], bins=50)
Out[55]:
(array([15.,  6.,  3.,  5.,  4.,  2.,  2.,  2.,  0.,  0.,  1.,  1.,  1.,
         1.,  1.,  2.,  2.,  0.,  0.,  0.,  2.,  1.,  0.,  1.,  0.,  0.,
         0.,  0.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  2.,
         1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.]),
 array([  14.  ,  101.42,  188.84,  276.26,  363.68,  451.1 ,  538.52,
         625.94,  713.36,  800.78,  888.2 ,  975.62, 1063.04, 1150.46,
        1237.88, 1325.3 , 1412.72, 1500.14, 1587.56, 1674.98, 1762.4 ,
        1849.82, 1937.24, 2024.66, 2112.08, 2199.5 , 2286.92, 2374.34,
        2461.76, 2549.18, 2636.6 , 2724.02, 2811.44, 2898.86, 2986.28,
        3073.7 , 3161.12, 3248.54, 3335.96, 3423.38, 3510.8 , 3598.22,
        3685.64, 3773.06, 3860.48, 3947.9 , 4035.32, 4122.74, 4210.16,
        4297.58, 4385.  ]),
 <BarContainer object of 50 artists>)
No description has been provided for this image
In [56]:
# Lets plot a frequency polygon superimposed on a histogram using the seaborn package.
# Seaborn automatically creates class intervals. The number of bins can also be manually set.

sns.distplot(basketball['Score']) 
Out[56]:
<AxesSubplot:xlabel='Score', ylabel='Density'>
No description has been provided for this image
In [57]:
# Lets add an argument to plot only the frequency polygon

sns.distplot(basketball['Score'], hist=False)
Out[57]:
<AxesSubplot:xlabel='Score', ylabel='Density'>
No description has been provided for this image
In [58]:
# Lets plot a violin plot using the seaborn package.

# This distribution can also be visualised in another manner. 
# For this we can use the violin plot function from seaborn. 
# The violin plot shows a vertical mirror image of the distribution 
# along with the original distribution.

sns.violinplot(basketball['Score']) 
Out[58]:
<AxesSubplot:xlabel='Score'>
No description has been provided for this image
In [59]:
# Now let us have a closer look at the distribution by plotting a simple histogram with 10 bins.

plt.figure(figsize=(20,10)) # makes the plot wider
plt.hist(basketball['Score'], color='g') # plots a simple histogram
plt.axvline(basketball['Score'].mean(), color='r', linewidth=2, label='Mean')
plt.axvline(basketball['Score'].median(), color='b', linestyle='dashed', linewidth=2, label='Median')
plt.axvline(basketball['Score'].mode()[0], color='k', linestyle='dashed', linewidth=2, label='Mode')
plt.legend()
Out[59]:
<matplotlib.legend.Legend at 0x21f39695670>
No description has been provided for this image
In [60]:
# Create boxplot for column="Score"

# Make a box plot from DataFrame columns.

# Make a box-and-whisker plot from DataFrame columns, optionally grouped
# by some other columns. A box plot is a method for graphically depicting
# groups of numerical data through their quartiles.
# The box extends from the Q1 to Q3 quartile values of the data,
# with a line at the median (Q2). The whiskers extend from the edges
# of box to show the range of the data. By default, they extend no more than
# `1.5 * IQR (IQR = Q3 - Q1)` from the edges of the box, ending at the farthest
# data point within that interval. Outliers are plotted as separate dots.

basketball.boxplot(column="Score",return_type='axes',figsize=(8,8))

# sns.boxplot(x=basketball['Score'])
Out[60]:
<AxesSubplot:>
No description has been provided for this image
In [61]:
# Histogram for checking the Skewness

plt.figure(figsize=(10,5))

#convert pandas DataFrame object to numpy array and sort
h = np.asarray(basketball['Score'])
h = sorted(h)
 
#use the scipy stats module to fit a normal distirbution with same mean and standard deviation
fit = stats.norm.pdf(h, np.mean(h), np.std(h)) 
 
#plot both series on the histogram
plt.plot(h,fit,'-',linewidth = 2,label="Normal distribution with same mean and var")
plt.hist(h,density=True,bins = 100,label="Actual distribution")      
plt.legend()
plt.show()
No description has been provided for this image

Important Insights:

  • We can see from the above chart that Mean > Median > Mode. This implies a Positive or Right Skewed Distribution.
  • From the above figure we can see that the mean is represented by the Red line and the mode by the Black line . The median is represented by the Blue line.
  • Most of the observations are within the first bin out of the 10 bins i.e. 35 Teams from the total 60. Most of the score is between 0 to 500.
  • There are few teams whose score is more than 1500.
  • We don't have any team whose score is between 3500 to 4000.
  • There are very few teams which got more than 4000 score. We can consider them as outliers.

Explore the 'TeamLaunch' Variable¶

In [62]:
#Flexibly plot a univariate distribution of observations.

sns.distplot(basketball.TeamLaunch)
Out[62]:
<AxesSubplot:xlabel='TeamLaunch', ylabel='Density'>
No description has been provided for this image
In [63]:
# We can categorize variable Teamlaunch based on Quantiles

# Quantile-based discretization function.

# Discretize variable into equal-sized buckets based on rank or based
# on sample quantiles. For example 1000 values for 10 quantiles would
# produce a Categorical object indicating quantile membership for each data point.

basketball['TeamLaunchCategory'] = (pd.qcut(basketball['TeamLaunch'], 4, 
                                              labels=['Very Old', 'Old', 'New', 'Very New']));
In [64]:
# Show the counts of observations in each categorical bin using bars.

# A count plot can be thought of as a histogram across a categorical, instead
# of quantitative, variable. The basic API and options are identical to those
# for :func:`barplot`, so you can compare counts across nested variables.

plt.figure(figsize=(25,8))
sns.countplot(x='TeamLaunch',hue='TeamLaunchCategory',data = basketball)
plt.title("Team Launch count along with Category",size=15)
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image

Step-3: Multivariate Analysis¶

Multivariate analysis is performed to understand interactions between different fields in the dataset (or) finding interactions between more than 2 variables.

Examples: Pairplot, 3D scatter plot Etc.

Covariance¶

In [65]:
# Prints the covariance of each attribute against every other attribute

basketball.cov()
Out[65]:
Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion RunnerUp TeamLaunch HighestPositionHeld
Tournament 722.782768 3.003033e+04 2.355925e+04 10671.614124 5372.170904 7516.715537 3.962956e+04 3.090991e+04 87.383898 79.441243 -429.234746 -101.172034
Score 30030.333051 1.295825e+06 9.785666e+05 463704.696610 220292.385593 294612.626271 1.715454e+06 1.250509e+06 4489.268644 3982.835593 -16335.672034 -4049.124576
PlayedGames 23559.249153 9.785666e+05 7.699455e+05 346774.216949 176166.940678 247031.154237 1.286941e+06 1.011131e+06 2783.174576 2541.345763 -13590.384746 -3316.361017
WonGames 10671.614124 4.637047e+05 3.467742e+05 166857.049718 77508.658192 102428.853672 6.172252e+05 4.401460e+05 1696.289831 1490.917514 -5882.320339 -1406.527119
DrawnGames 5372.170904 2.202924e+05 1.761669e+05 77508.658192 40798.145480 57867.608757 2.877725e+05 2.333322e+05 560.792373 523.014124 -3068.826271 -773.258475
LostGames 7516.715537 2.946126e+05 2.470312e+05 102428.853672 57867.608757 86735.338701 3.820110e+05 3.376645e+05 526.338136 527.657062 -4642.466949 -1136.736441
BasketScored 39629.558475 1.715454e+06 1.286941e+06 617225.191525 287772.461864 382010.997458 2.286337e+06 1.638664e+06 6203.568644 5471.794915 -22166.405932 -5213.356780
BasketGiven 30909.909040 1.250509e+06 1.011131e+06 440145.958192 233332.166667 337664.535028 1.638664e+06 1.354772e+06 3027.689831 2840.439548 -18885.327119 -4499.689831
TournamentChampion 87.383898 4.489269e+03 2.783175e+03 1696.289831 560.792373 526.338136 6.203569e+03 3.027690e+03 30.421186 24.513559 -42.146610 -8.921186
RunnerUp 79.441243 3.982836e+03 2.541346e+03 1490.917514 523.014124 527.657062 5.471795e+03 2.840440e+03 24.513559 20.927684 -38.672881 -8.733898
TeamLaunch -429.234746 -1.633567e+04 -1.359038e+04 -5882.320339 -3068.826271 -4642.466949 -2.216641e+04 -1.888533e+04 -42.146610 -38.672881 710.048305 84.934746
HighestPositionHeld -101.172034 -4.049125e+03 -3.316361e+03 -1406.527119 -773.258475 -1136.736441 -5.213357e+03 -4.499690e+03 -8.921186 -8.733898 84.934746 28.251695

Correlation¶

In [66]:
# Prints the correlation coefficient between every pair of attributes

basketball.corr()
Out[66]:
Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion RunnerUp TeamLaunch HighestPositionHeld
Tournament 1.000000 0.981258 0.998683 0.971749 0.989295 0.949350 0.974867 0.987781 0.589304 0.645924 -0.599166 -0.708002
Score 0.981258 1.000000 0.979687 0.997232 0.958090 0.878780 0.996634 0.943801 0.715014 0.764819 -0.538542 -0.669215
PlayedGames 0.998683 0.979687 1.000000 0.967486 0.993972 0.955925 0.969970 0.990020 0.575072 0.633101 -0.581243 -0.711065
WonGames 0.971749 0.997232 0.967486 1.000000 0.939416 0.851436 0.999312 0.925745 0.752904 0.797849 -0.540421 -0.647819
DrawnGames 0.989295 0.958090 0.993972 0.939416 1.000000 0.972786 0.942234 0.992479 0.503378 0.566021 -0.570175 -0.720248
LostGames 0.949350 0.878780 0.955925 0.851436 0.972786 1.000000 0.857843 0.985041 0.324025 0.391645 -0.591571 -0.726172
BasketScored 0.974867 0.996634 0.969970 0.999312 0.942234 0.857843 1.000000 0.931079 0.743847 0.791041 -0.550150 -0.648672
BasketGiven 0.987781 0.943801 0.990020 0.925745 0.992479 0.985041 0.931079 1.000000 0.471618 0.533448 -0.608902 -0.727323
TournamentChampion 0.589304 0.715014 0.575072 0.752904 0.503378 0.324025 0.743847 0.471618 1.000000 0.971533 -0.286768 -0.304307
RunnerUp 0.645924 0.764819 0.633101 0.797849 0.566021 0.391645 0.791041 0.533448 0.971533 1.000000 -0.317250 -0.359191
TeamLaunch -0.599166 -0.538542 -0.581243 -0.540421 -0.570175 -0.591571 -0.550150 -0.608902 -0.286768 -0.317250 1.000000 0.599679
HighestPositionHeld -0.708002 -0.669215 -0.711065 -0.647819 -0.720248 -0.726172 -0.648672 -0.727323 -0.304307 -0.359191 0.599679 1.000000

Heatmap¶

In [67]:
# Plot rectangular data as a color-encoded matrix.

# This is an Axes-level function and will draw the heatmap into the
# currently-active Axes if none is provided to the ``ax`` argument.  Part of
# this Axes space will be taken and used to plot a colormap, unless ``cbar``
# is False or a separate Axes is provided to ``cbar_ax``.

plt.figure(figsize=(10,5))
corrmat = basketball.corr(method='pearson')
sns.heatmap(corrmat, cmap="YlGnBu", fmt='.2f',annot=True)
plt.show();
No description has been provided for this image

Scatterplot - All Variables¶

In [68]:
# In the following plot scatter diagrams are plotted for all the numerical columns in the dataset. 
# A scatter plot is a visual representation of the degree of correlation between any two columns. 
# The pair plot function in seaborn makes it very easy to generate joint scatter plots for all the 
# columns in the data.

plt.figure(figsize=(10,5))
pairplot=sns.pairplot(basketball);
plt.show()
<Figure size 720x360 with 0 Axes>
No description has been provided for this image

Scatterplot - Selected Variables¶

In [69]:
# Plots the scatter plot using two variables

sns.scatterplot(data=basketball, x="Tournament", y="Score", size="Score")
Out[69]:
<AxesSubplot:xlabel='Tournament', ylabel='Score'>
No description has been provided for this image
In [70]:
# Another way of looking at multivariate scatter plot is to use the hue option
# in the scatterplot() function in seaborn.

# basketball['TeamLaunchCategory'] = (pd.qcut(basketball['TeamLaunch'], 4, 
#                                     labels=['Very Old', 'Old', 'New', 'Very New']));

sns.scatterplot(data=basketball, x="Tournament", y="Score", hue="TeamLaunchCategory")
Out[70]:
<AxesSubplot:xlabel='Tournament', ylabel='Score'>
No description has been provided for this image

Important Insights:

  • In Sumarry, We have a Highly Correlated dataset.
  • Score, WonGames, BasketScored correlation value is 1. So it's a perfect positive correlation.
  • PlayedGames and Tournament correlation value is 1. So again it's a perfect positive correlation.
  • Other than TeamLaunch and HighestPositionHeld remaining all fields are positively correlated.
  • TeamLaunch is negatively correlated; For old teams given values are more and for new teams it's low.
  • PlayedGames and Drawn Games correlation is 0.99, It means most of the games are Drawn compared to wonGames and LostGames.
  • TournamentChampion and RunnerUp are the counts. Here we have 30% to 70% correlation with other

attributes.

Explore Tournament vs TournamentChampion vs Runnerup¶

In [71]:
SortedDf=basketball.sort_values('Tournament', ascending=False);
 
plt.figure(figsize=(40,10))

# set height of bar
bars1 = SortedDf.Tournament
bars2 = SortedDf.TournamentChampion
bars3 = SortedDf.RunnerUp

 # set width of bar
barWidth = 0.3

# Set position of bar on X axis
r1 = np.arange(len(bars1)) + barWidth
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]
r4 = [x + barWidth for x in r3]

# Make the plot
plt.bar(r1, bars1, color='b', width=barWidth, edgecolor='white', label='Tournament')
plt.bar(r2, bars2, color='g', width=barWidth, edgecolor='white', label='TournamentChampion')
plt.bar(r3, bars3, color='y', width=barWidth, edgecolor='white', label='RunnerUp')

# Add xticks on the middle of the group bars
plt.xlabel('Teams', fontweight='bold')
plt.xticks([r + (barWidth) for r in range(len(bars1))], SortedDf.Team)
plt.title("Tournament vs TournamentChampion vs Runnerup ",size=35);

# Create legend & Show graphic
plt.legend()
plt.show()
No description has been provided for this image

Explore PlayedGames vs WonGames Vs DrawnGames vs LostGames¶

In [72]:
SortedDf=basketball.sort_values('PlayedGames',ascending=False);
  
plt.figure(figsize=(40,10))

# set height of bar
bars1 = SortedDf.PlayedGames
bars2 = SortedDf.WonGames
bars3 = SortedDf.DrawnGames
bars4 = SortedDf.LostGames

 # set width of bar
barWidth =  0.3

# Set position of bar on X axis
r1 = np.arange(len(bars1))+barWidth
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]
r4 = [x + barWidth for x in r3]

# Make the plot
plt.bar(r1, bars1, color='b', width=barWidth, edgecolor='white', label='PlayedGames')
plt.bar(r2, bars2, color='g', width=barWidth, edgecolor='white', label='WonGames')
plt.bar(r3, bars3, color='y', width=barWidth, edgecolor='white', label='DrawnGames')
plt.bar(r4, bars4, color='m', width=barWidth, edgecolor='white', label='LostGames')

# Add xticks on the middle of the group bars
plt.xlabel('Teams', fontweight='bold')
plt.xticks([r + (3*barWidth) for r in range(len(bars1))], SortedDf.Team)
plt.title("PlayedGames vs WonGames Vs DrawnGames vs LostGames",size=35)

# Create legend & Show graphic
plt.legend()
plt.show()
No description has been provided for this image

Explore BasketScored vs BasketGiven¶

In [73]:
SortedDf=basketball.sort_values('BasketScored',ascending=False);
  
plt.figure(figsize=(40,10))

# set height of bar
bars1 = SortedDf.BasketScored
bars2 = SortedDf.BasketGiven

 # set width of bar
barWidth =  0.3

# Set position of bar on X axis
r1 = np.arange(len(bars1))+barWidth
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]
r4 = [x + barWidth for x in r3]

# Make the plot
plt.bar(r1, bars1, color='g', width=barWidth, edgecolor='white', label='BasketScored')
plt.bar(r2, bars2, color='r', width=barWidth, edgecolor='white', label='BasketGiven')

# Add xticks on the middle of the group bars
plt.xlabel('Teams', fontweight='bold')
plt.xticks([r + (barWidth) for r in range(len(bars1))], SortedDf.Team)
plt.title("BasketScored vs BasketGiven",size=35)

# Create legend & Show graphic
plt.legend()
plt.show()
No description has been provided for this image

Explore some other Charts¶

In [74]:
# No of matches won/lost by teams

plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
bb = basketball[['Team','WonGames']].sort_values(by="WonGames", ascending=False)
plot = plt.pie(bb['WonGames'], labels=list(bb['Team'][:15]) + [str()] * (len(bb)-15), explode=[0.015*x for x in range(len(bb))])
plt.title('# matches won')
plt.subplot(1,2,2)
bb = basketball[['Team','LostGames']].sort_values(by="LostGames", ascending=False)
plot = plt.pie(bb['LostGames'], labels=list(bb['Team'][:20]) + [str()] * (len(bb)-20), explode=[0.015*x for x in range(len(bb))])
plt.title('# matches lost')
Out[74]:
Text(0.5, 1.0, '# matches lost')
No description has been provided for this image
In [75]:
# No of baskets scored/given by teams

plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
bb = basketball[['Team','BasketScored']].sort_values(by="BasketScored", ascending=False)
plot = plt.pie(bb['BasketScored'], labels=list(bb['Team'][:15]) + [str()] * (len(bb)-15))
plt.title('# Baskets scored')
plt.subplot(1,2,2)
bb = basketball[['Team','BasketGiven']].sort_values(by="BasketGiven", ascending=False)
plot = plt.pie(bb['BasketGiven'], labels=list(bb['Team'][:20]) + [str()] * (len(bb)-20))
plt.title('# Baskets given')
Out[75]:
Text(0.5, 1.0, '# Baskets given')
No description has been provided for this image
In [76]:
# The most RunnerUp
plt.figure(figsize=(20,8))
bb = basketball[['Team','RunnerUp']].sort_values(by="RunnerUp", ascending=False).where(lambda x: x["RunnerUp"] > 0)
sns.barplot(bb['Team'], bb['RunnerUp'])
plt.xticks(rotation=90)
Out[76]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12]),
 [Text(0, 0, 'T2'),
  Text(1, 0, 'T1'),
  Text(2, 0, 'T3'),
  Text(3, 0, 'T5'),
  Text(4, 0, 'T4'),
  Text(5, 0, 'T11'),
  Text(6, 0, 'T6'),
  Text(7, 0, 'T8'),
  Text(8, 0, 'T9'),
  Text(9, 0, 'T21'),
  Text(10, 0, 'T20'),
  Text(11, 0, 'T14'),
  Text(12, 0, 'T15')])
No description has been provided for this image

Important Insights:

  • Teams that have enjoyed more active participation are from T1 to T8. The performance of T7 was quite low in comparison to other teams.
  • Many teams participated in tournaments, but could not secure the first two positions.
  • Top 3 high performing teams are T1, T2, and T5.
  • T11 participated in less tournaments but it showed good results compared to many other teams.
  • T20 and T21 have participated in less tournaments but they stood in second places in some games.
  • T1, T2 and T5 played the most tournament matches but T1 and T2 happens to be the highest average scorer.
  • T3 played less games,but have kept their position up in the leader-board in comparison to T4 and T5
  • T1 seems to be a team with matured experience and game spirit.
  • T2 also exhibited fine and profound game skills as it appeared the most in the leader-board and runner-up rankings.

Step-4: Bivariate Analysis¶

Through bivariate analysis we try to analyze two variables simultaneously. As opposed to univariate analysis where we check the characteristics of a single variable, in bivariate analysis we try to determine if there is any relationship between two variables.

There are essentially 3 major scenarios that we will come accross when we perform bivariate analysis:

  1. Both variables of interest are qualitative
  2. One variable is qualitative and the other is quantitative
  3. Both variables are quantitative

Explore Team vs PlayedGames with TeamLaunch¶

In [77]:
SortedDf=basketball.sort_values('PlayedGames',ascending=False);
fig, ax = plt.subplots(figsize=(40,10))   # setting the figure size of the plot
ax.scatter(SortedDf['Team'], SortedDf['PlayedGames'])  # scatter plot
ax.set_xlabel('Team ', fontsize=20)
ax.set_ylabel('Played Games', fontsize=20)

team=np.array(SortedDf.Team);
basket=np.array(SortedDf.PlayedGames);
for i, txt in enumerate(SortedDf.TeamLaunch):
    plt.annotate(txt, (team[i], basket[i]))
plt.title("Team vs PlayedGames annotated with TeamLaunch",size=35);
plt.show()
No description has been provided for this image

Explore PlayedGames across TeamLaunchCategory¶

In [78]:
# basketball['TeamLaunchCategory'] = (pd.qcut(basketball['TeamLaunch'], 4, 
#                                     labels=['Very Old', 'Old', 'New', 'Very New']));

GamesPlayed=basketball.sort_values('PlayedGames',ascending=False).groupby('TeamLaunchCategory')['PlayedGames'].sum().reset_index();

plt.figure(figsize=(15,5))
sns.barplot(x ='TeamLaunchCategory', y='PlayedGames' ,data = GamesPlayed)
plt.title("PlayedGames across Team launch category",size=15)
plt.show()
No description has been provided for this image

Explore Tournament vs Team¶

In [79]:
plt.figure(figsize=(30,10))
ax = sns.barplot(x="Team", y="Tournament", data=basketball)
No description has been provided for this image

Explore Score Vs Teams¶

In [80]:
# Scores of the teams
plt.figure(figsize=(20,15))
ax = sns.barplot(x="Score", y="Team", data=basketball, orient='h')
No description has been provided for this image

Explore Score Vs Teams with TeamLaunchCategory¶

In [81]:
plt.figure(figsize=(40,10))
sns.barplot(data=basketball,x='Team',y='Score',hue='TeamLaunchCategory')
plt.show()
No description has been provided for this image

Step-5: Performance Matrix¶

Correlation Matrix¶

In [82]:
#Plot correlation matrix using pandas, and select values in upper triangle
#https://stackoverflow.com/questions/29432629/plot-correlation-matrix-using-pandas

CorMatrix = basketball.corr().abs()
Upper = CorMatrix.where(np.triu(np.ones(CorMatrix.shape),k=1).astype(np.bool))

Upper
Out[82]:
Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion RunnerUp TeamLaunch HighestPositionHeld
Tournament NaN 0.981258 0.998683 0.971749 0.989295 0.949350 0.974867 0.987781 0.589304 0.645924 0.599166 0.708002
Score NaN NaN 0.979687 0.997232 0.958090 0.878780 0.996634 0.943801 0.715014 0.764819 0.538542 0.669215
PlayedGames NaN NaN NaN 0.967486 0.993972 0.955925 0.969970 0.990020 0.575072 0.633101 0.581243 0.711065
WonGames NaN NaN NaN NaN 0.939416 0.851436 0.999312 0.925745 0.752904 0.797849 0.540421 0.647819
DrawnGames NaN NaN NaN NaN NaN 0.972786 0.942234 0.992479 0.503378 0.566021 0.570175 0.720248
LostGames NaN NaN NaN NaN NaN NaN 0.857843 0.985041 0.324025 0.391645 0.591571 0.726172
BasketScored NaN NaN NaN NaN NaN NaN NaN 0.931079 0.743847 0.791041 0.550150 0.648672
BasketGiven NaN NaN NaN NaN NaN NaN NaN NaN 0.471618 0.533448 0.608902 0.727323
TournamentChampion NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.971533 0.286768 0.304307
RunnerUp NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.317250 0.359191
TeamLaunch NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.599679
HighestPositionHeld NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [83]:
PerformanceMatrix=basketball[['Team','TeamLaunch','TeamLaunchCategory','PlayedGames','HighestPositionHeld']]

PerformanceMatrix['Win']=round((basketball.WonGames/ basketball.PlayedGames)*100,2)
PerformanceMatrix['Drawn']=round((basketball.DrawnGames/ basketball.PlayedGames)*100,2)
PerformanceMatrix['Lost']=round((basketball.LostGames/ basketball.PlayedGames)*100,2)
PerformanceMatrix['TChampionship']=round((basketball.TournamentChampion/ basketball.Tournament)*100,2)
PerformanceMatrix['TRunnerUp']=round((basketball.RunnerUp/ basketball.Tournament)*100,2)

PerformanceMatrix.describe()
Out[83]:
TeamLaunch PlayedGames HighestPositionHeld Win Drawn Lost TChampionship TRunnerUp
count 60.000000 60.000000 60.000000 60.000000 60.000000 60.000000 60.000000 60.000000
mean 1957.950000 810.100000 7.050000 31.364667 23.617167 44.992167 1.749500 1.956333
std 26.646732 877.465393 5.315232 7.831421 4.459405 8.401111 6.442478 5.491846
min 1929.000000 30.000000 1.000000 16.670000 11.110000 20.380000 0.000000 0.000000
25% 1934.750000 115.500000 3.000000 27.610000 22.330000 41.147500 0.000000 0.000000
50% 1950.500000 424.500000 6.000000 30.490000 24.370000 45.725000 0.000000 0.000000
75% 1977.250000 1345.500000 10.000000 33.542500 26.212500 48.542500 0.000000 0.000000
max 2016.000000 2762.000000 20.000000 59.630000 38.600000 70.000000 38.370000 29.070000
In [84]:
fig, ax = plt.subplots(1, 3)

fig.set_figheight(5)
fig.set_figwidth(15)

sns.distplot(PerformanceMatrix['Win'], ax = ax[0])
sns.distplot(PerformanceMatrix['Lost'], ax = ax[1])
sns.distplot(PerformanceMatrix['Drawn'], ax = ax[2])

ax[0].set_title('Win %')
ax[1].set_title('Lost %')
ax[2].set_title('Drawn %')
plt.show()
No description has been provided for this image
In [85]:
fig, ax = plt.subplots(2, 3)

fig.set_figheight(5)
fig.set_figwidth(15)

sns.boxplot(x=PerformanceMatrix['Win'],ax=ax[0][0]);
sns.boxplot(x=PerformanceMatrix['Lost'],ax=ax[0][1]);
sns.boxplot(x=PerformanceMatrix['Drawn'],ax=ax[0][2]);
sns.boxplot(x=PerformanceMatrix['TChampionship'],ax=ax[1][0]);
sns.boxplot(x=PerformanceMatrix['TRunnerUp'],ax=ax[1][1]);
sns.boxplot(x=PerformanceMatrix['TeamLaunch'],ax=ax[1][2]);


fig.tight_layout() 
plt.show()
No description has been provided for this image

Finding Outliers using Z-Score¶

In [86]:
# Get the z-score of every value with respect to their columns

PmSelect = PerformanceMatrix[['Win','Drawn','Lost']];
z = np.abs(stats.zscore(PmSelect))

threshold = 3
np.where(z > threshold)
Out[86]:
(array([ 0,  1, 45, 59], dtype=int64), array([0, 0, 1, 2], dtype=int64))
In [87]:
PerformanceMatrix.iloc[[0,1,45,59]]
Out[87]:
Team TeamLaunch TeamLaunchCategory PlayedGames HighestPositionHeld Win Drawn Lost TChampionship TRunnerUp
0 T1 1929 Very Old 2762 1 59.63 19.99 20.38 38.37 26.74
1 T2 1929 Very Old 2762 1 57.24 20.75 22.01 29.07 29.07
45 T46 1990 Very New 114 9 22.81 38.60 38.60 0.00 0.00
59 T60 1955 New 30 15 16.67 13.33 70.00 0.00 0.00

Important Insights:

  • Ouliers due to high Win % and low Lost %: T1 and T2
  • Ouliers due to low Win % and high Lost %: T46 and T60

Step-6: Recommended Teams for Company X based on all the above Analysis¶

Explore Win% Vs Drawn% Vs Lost%¶

In [88]:
SortedDf=PerformanceMatrix.sort_values('Win',ascending=False);
 
plt.figure(figsize=(35,10));
# set height of bar
bars1 = SortedDf.Win
bars2 = SortedDf.Drawn
bars3 = SortedDf.Lost
 # set width of bar
barWidth =  0.2
# Set position of bar on X axis
r1 = np.arange(len(bars1))+barWidth
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]
# Make the plotax
plt.bar(r1, bars1, color='g', width=barWidth, edgecolor='white', label='Win%')
plt.bar(r2, bars2, color='y', width=barWidth, edgecolor='white', label='Drawn%')
plt.bar(r3, bars3, color='r', width=barWidth, edgecolor='white', label='Lost%')

# Add xticks on the middle of the group bars
plt.xlabel('Teams', fontweight='bold')
plt.xticks([r + (3*barWidth) for r in range(len(bars1))], SortedDf.Team)
plt.title("Win% Vs Drawn% Vs Lost%",size=35)
    
# Create legend & Show graphic
plt.legend()
plt.show()
No description has been provided for this image

Explore PlayedGames vs Teams vs Team Launch¶

In [89]:
SortedDf=(PerformanceMatrix.sort_values(['TeamLaunch'],ascending=[True]))

plt.figure(figsize=(30,5))
sns.pointplot(x='Team',y='PlayedGames',hue='TeamLaunchCategory' ,data=SortedDf);
plt.title("Played games across teams in the order of team launch ")
plt.show()
No description has been provided for this image

Performance Report of Teams in Playing Games¶

In [90]:
SortedDf=(PerformanceMatrix.sort_values(['TeamLaunch'],ascending=[True]))
fig, ax =plt.subplots(6,figsize=(30, 30), sharey=True)

sns.pointplot(x='Team',y='Win',hue='TeamLaunchCategory' ,data=SortedDf,ax=ax[0]);
sns.pointplot(x='Team',y='Drawn',hue='TeamLaunchCategory' ,data=SortedDf,ax=ax[1]);
sns.pointplot(x='Team',y='Lost',hue='TeamLaunchCategory' ,data=SortedDf,ax=ax[2]);
sns.pointplot(x='Team',y='TChampionship',hue='TeamLaunchCategory' ,data=SortedDf,ax=ax[3]);
sns.pointplot(x='Team',y='TRunnerUp',hue='TeamLaunchCategory' ,data=SortedDf,ax=ax[4]);
sns.pointplot(x='Team',y='HighestPositionHeld',hue='TeamLaunchCategory' ,data=SortedDf,ax=ax[5]);


fig.tight_layout()
plt.subplots_adjust(hspace=0.3)
ax[0].set_title("Performance report of teams in playing games\n Win%",size='30')
ax[1].set_title("Drawn %",size='30')
ax[2].set_title("Lost %",size='30')
ax[3].set_title("Tournment Championship %",size='30')
ax[4].set_title("Tournment RunnerUp  %",size='30')
ax[5].set_title("HighestPositionHeld",size='30')
Out[90]:
Text(0.5, 1.0, 'HighestPositionHeld')
No description has been provided for this image

Analyze Team Launch Categories¶

In [91]:
GroupLostMean=PerformanceMatrix.groupby(['TeamLaunchCategory'])['Lost'].mean().reset_index();
GroupDrawnMean=PerformanceMatrix.groupby(['TeamLaunchCategory'])['Drawn'].mean().reset_index();
GroupWinMean=PerformanceMatrix.groupby(['TeamLaunchCategory'])['Win'].mean().reset_index();

fig, ax =plt.subplots((3),figsize=(10, 10), sharey=True)

sns.pointplot(x=GroupWinMean.TeamLaunchCategory,y=GroupWinMean.Win,data=GroupWinMean,ax=ax[0]);
sns.pointplot(x=GroupDrawnMean.TeamLaunchCategory,y=GroupDrawnMean.Drawn,data=GroupDrawnMean,ax=ax[1]);
sns.pointplot(x=GroupLostMean.TeamLaunchCategory,y=GroupLostMean.Lost,data=GroupLostMean,ax=ax[2]);

fig.tight_layout()
plt.subplots_adjust(hspace=0.4)
ax[0].set_title("Analyze Team launch categories \n Mean Win%",size='20')
ax[1].set_title("Mean Drawn %",size='20')
ax[2].set_title("Mean Lost %",size='20')
plt.show()
No description has been provided for this image

Important Insights:

  • Av. drawn is high for very new teams comapred to old teams; With some more practise this drawn % can be converted to more successes.
  • Av. Lost is low for very new teams comapred to other teams; Very new teams are better compared to old/new teams.
  • Av. win % is high for very old teams and gradully decreasing for new teams.
  • More detailed information can be inferred from the "Team vs PlayedGames Annotated with TeamLaunch" and other Charts.

Top 10 teams in the given list with Hightest Win %¶

In [92]:
PerformanceMatrix.sort_values(['Win'],ascending=[False]).head(10)
Out[92]:
Team TeamLaunch TeamLaunchCategory PlayedGames HighestPositionHeld Win Drawn Lost TChampionship TRunnerUp
0 T1 1929 Very Old 2762 1 59.63 19.99 20.38 38.37 26.74
1 T2 1929 Very Old 2762 1 57.24 20.75 22.01 29.07 29.07
2 T3 1929 Very Old 2614 1 47.48 22.88 29.65 12.50 10.00
3 T4 1931 Very Old 2664 1 44.56 23.12 32.32 7.32 7.32
4 T5 1929 Very Old 2762 1 43.77 22.92 33.31 9.30 8.14
20 T21 1998 Very New 646 2 41.18 26.63 32.20 0.00 5.88
5 T6 1934 Very Old 2408 1 41.11 22.05 36.84 1.37 5.48
7 T8 1929 Very Old 2302 1 37.53 25.07 37.40 2.86 4.29
10 T11 1941 Old 1530 1 36.80 25.62 37.58 2.22 11.11
6 T7 1929 Very Old 2626 3 36.10 23.15 40.75 0.00 0.00

Top 10 winning teams excluding very old teams¶

In [93]:
PerformanceMatrix[PerformanceMatrix.TeamLaunchCategory != 'Very Old'].sort_values(['Win'],ascending=[False]).head(10)
Out[93]:
Team TeamLaunch TeamLaunchCategory PlayedGames HighestPositionHeld Win Drawn Lost TChampionship TRunnerUp
20 T21 1998 Very New 646 2 41.18 26.63 32.20 0.00 5.88
10 T11 1941 Old 1530 1 36.80 25.62 37.58 2.22 11.11
8 T9 1939 Old 1986 2 35.15 26.28 38.57 0.00 1.72
11 T12 1939 Old 1698 4 34.51 22.91 42.58 0.00 0.00
18 T19 1960 New 988 3 33.70 25.91 40.38 0.00 0.00
19 T20 1951 New 1096 2 33.49 22.08 44.43 0.00 3.03
38 T39 1994 Very New 160 10 32.50 28.12 39.38 0.00 0.00
15 T16 1935 Old 1318 4 32.32 24.81 42.87 0.00 0.00
14 T15 1944 Old 1458 2 32.30 24.55 43.14 0.00 2.33
24 T25 2004 Very New 456 6 32.24 24.56 43.20 0.00 0.00

Top Teams with High Performance¶

In [94]:
basketball[(basketball.PlayedGames==basketball.PlayedGames.max())]
Out[94]:
Team Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion RunnerUp TeamLaunch HighestPositionHeld TeamLaunchCategory
0 T1 86 4385 2762 1647 552 563 5947 3140 33 23 1929 1 Very Old
1 T2 86 4262 2762 1581 573 608 5900 3114 25 25 1929 1 Very Old
4 T5 86 3368 2762 1209 633 920 4631 3700 8 7 1929 1 Very Old

Best Performance Team in the Dataset¶

In [95]:
PerformanceMatrix[(PerformanceMatrix.Win==PerformanceMatrix.Win.max()) & (PerformanceMatrix.Lost==PerformanceMatrix.Lost.min())]
Out[95]:
Team TeamLaunch TeamLaunchCategory PlayedGames HighestPositionHeld Win Drawn Lost TChampionship TRunnerUp
0 T1 1929 Very Old 2762 1 59.63 19.99 20.38 38.37 26.74
In [96]:
# Performance order of teams is T1>T2>T5
# Above teams are Outliers, So we look for other max target teams

basketball[(basketball.PlayedGames==basketball.PlayedGames.nlargest(4)[3])]
Out[96]:
Team Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion RunnerUp TeamLaunch HighestPositionHeld TeamLaunchCategory
3 T4 82 3386 2664 1187 616 861 4398 3469 6 6 1931 1 Very Old

Teams with Low Performance¶

In [97]:
(basketball.sort_values(['PlayedGames'],ascending=[True])).head(3)
Out[97]:
Team Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion RunnerUp TeamLaunch HighestPositionHeld TeamLaunchCategory
59 T60 1 14 30 5 4 21 34 65 0 0 1955 15 New
57 T58 1 22 30 7 8 15 37 57 0 0 1956 16 New
58 T59 1 19 30 7 5 18 51 85 0 0 1951 16 New

Teams with high rank in position¶

In [98]:
(basketball.sort_values(['HighestPositionHeld'],ascending=[False])).head(1)
Out[98]:
Team Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion RunnerUp TeamLaunch HighestPositionHeld TeamLaunchCategory
56 T57 1 34 38 8 10 20 38 66 0 0 2009 20 Very New

Old Teams with less performance¶

In [99]:
sorted=PerformanceMatrix.loc[PerformanceMatrix['TeamLaunchCategory']=='Very Old'].sort_values(['Win'],ascending=[True])
sorted.head(5)
Out[99]:
Team TeamLaunch TeamLaunchCategory PlayedGames HighestPositionHeld Win Drawn Lost TChampionship TRunnerUp
51 T52 1929 Very Old 72 6 29.17 19.44 51.39 0.0 0.00
13 T14 1929 Very Old 1428 2 31.72 23.53 44.75 0.0 2.27
28 T29 1930 Very Old 380 6 32.89 21.32 45.79 0.0 0.00
44 T45 1929 Very Old 130 3 33.08 16.15 50.77 0.0 0.00
53 T54 1929 Very Old 54 8 33.33 11.11 55.56 0.0 0.00

Team with Most Drawn games¶

In [100]:
PerformanceMatrix.sort_values(['Drawn'],ascending=[False]).head(1)
Out[100]:
Team TeamLaunch TeamLaunchCategory PlayedGames HighestPositionHeld Win Drawn Lost TChampionship TRunnerUp
45 T46 1990 Very New 114 9 22.81 38.6 38.6 0.0 0.0

Old teams with low targets¶

In [101]:
sorted=(basketball.sort_values(['PlayedGames'],ascending=[True]))

sorted[(sorted.TeamLaunch==sorted.TeamLaunch.min())].head(2)
Out[101]:
Team Tournament Score PlayedGames WonGames DrawnGames LostGames BasketScored BasketGiven TournamentChampion RunnerUp TeamLaunch HighestPositionHeld TeamLaunchCategory
53 T54 3 42 54 18 6 30 97 131 0 0 1929 8 Very Old
51 T52 4 56 72 21 14 37 153 184 0 0 1929 6 Very Old

Recommended Teams for company X based on all the above analysis¶

In [102]:
TeamsFilters=['T21','T39','T46','T19','T20','T11','T9','T5','T3','T4','T6','T14','T7','T8','T10','T18']

BetterTeams=pd.DataFrame(columns=PerformanceMatrix.columns);

for i in TeamsFilters:
       if i in PerformanceMatrix.Team.values:
            BetterTeams = pd.concat([BetterTeams,PerformanceMatrix[PerformanceMatrix.Team==i]], ignore_index=True)
                 
BetterTeams.sort_values('Win',ascending=False)
Out[102]:
Team TeamLaunch TeamLaunchCategory PlayedGames HighestPositionHeld Win Drawn Lost TChampionship TRunnerUp
8 T3 1929 Very Old 2614 1 47.48 22.88 29.65 12.50 10.00
9 T4 1931 Very Old 2664 1 44.56 23.12 32.32 7.32 7.32
7 T5 1929 Very Old 2762 1 43.77 22.92 33.31 9.30 8.14
0 T21 1998 Very New 646 2 41.18 26.63 32.20 0.00 5.88
10 T6 1934 Very Old 2408 1 41.11 22.05 36.84 1.37 5.48
13 T8 1929 Very Old 2302 1 37.53 25.07 37.40 2.86 4.29
5 T11 1941 Old 1530 1 36.80 25.62 37.58 2.22 11.11
12 T7 1929 Very Old 2626 3 36.10 23.15 40.75 0.00 0.00
6 T9 1939 Old 1986 2 35.15 26.28 38.57 0.00 1.72
14 T10 1932 Very Old 1728 1 35.07 25.46 39.47 1.96 0.00
15 T18 1933 Very Old 1192 3 34.23 24.50 41.28 0.00 0.00
3 T19 1960 New 988 3 33.70 25.91 40.38 0.00 0.00
4 T20 1951 New 1096 2 33.49 22.08 44.43 0.00 3.03
1 T39 1994 Very New 160 10 32.50 28.12 39.38 0.00 0.00
11 T14 1929 Very Old 1428 2 31.72 23.53 44.75 0.00 2.27
2 T46 1990 Very New 114 9 22.81 38.60 38.60 0.00 0.00

Recommended Teams for company X based on all the above analysis (Excluding the Very Old Teams)¶

In [103]:
BetterTeams[BetterTeams.TeamLaunchCategory!='Very Old']
Out[103]:
Team TeamLaunch TeamLaunchCategory PlayedGames HighestPositionHeld Win Drawn Lost TChampionship TRunnerUp
0 T21 1998 Very New 646 2 41.18 26.63 32.20 0.00 5.88
1 T39 1994 Very New 160 10 32.50 28.12 39.38 0.00 0.00
2 T46 1990 Very New 114 9 22.81 38.60 38.60 0.00 0.00
3 T19 1960 New 988 3 33.70 25.91 40.38 0.00 0.00
4 T20 1951 New 1096 2 33.49 22.08 44.43 0.00 3.03
5 T11 1941 Old 1530 1 36.80 25.62 37.58 2.22 11.11
6 T9 1939 Old 1986 2 35.15 26.28 38.57 0.00 1.72

Step-7: Improvements or suggestions to the association management on quality, quantity, variety, velocity, veracity etc. on the data points collected by the association to perform a better data analysis in future.¶

Please find below suggestions for data point collection and other relevant guidelines:

  1. Volume: We can add some more teams for better data understanding. To increase the prediction power of the dataset, we can add some other information like player information, demographics and tournament location Etc. Database systems can move from the traditional to the more advanced big data systems.

  2. Velocity: If we consider the time value of the data, It seems to be outdated and of little use; Particularly if the Big Data project is to serve any real-time or near real-time business needs. In such context we should re-define data quality metrics so that they are relevant as well as feasible in the real-time context.

  3. Variety: For better insights and modeling projects in AI and ML, we can add several other data types like structured, semi-structured, and unstructured coming in from different data sources relevant to the basketball.

  4. Veracity: We have incomplete team information. For example in Team 61: This Team don't have any information about Score, PlayedGames Etc.. It has HighestPoistionHeld as 1. Accuracy of data collection should be improved. Besides data inaccuracies, Veracity also includes data consistency (defined by the statistical reliability of data) and data trustworthiness (based on data origin, data collection and processing methods, security infrastructure, etc.). These data quality issues in turn impact data integrity and data accountability.

  5. Value: The Value characteristic connects directly to the end purpose and the business use cases. We can harness the power of Big Data for many diverse business pursuits, and those pursuits are the real drivers of how data quality is defined, measured, and improved. Data Science is already playing a pivotal role in sports analytics.

  6. Based on a strong understanding of the business use cases and the Big Data architecture, we can design and implement an optimal layer of data governance strategy to further improve the data quality with data definitions, metadata requirements, data ownership, data flow diagrams, etc.

  7. We can add more attributes to the dataset. More relevant attributes will help us to analyze teams accurately like Canceled Games, Basket Ratio, Winning Ratio, Win/Loss Percentage

  8. Simplest ML models can add more value to the present EDA use case.

  9. Big data and data science together allow us to see both the forest and the trees (Micro and Macro perspectives).

  10. Visualization, Dashboarding, and Interactivity makes the data more useful to the general public. We can use the API and deploy it on the cloud to serve our purpose in this context.

References:¶

  1. Towards Data Science. Sports Analytics
  2. Kaggle. Kaggle Code
  3. KdNuggets
  4. AnalyticsVidhya
  5. Wikipedia. Basketball
  6. Wikipedia. Sports Analytics
  7. Wikipedia. National Basketball Association
  8. Zuccolotto, Manisera, & Sandri. "Basketball Data Science: With Applications in R." Chapman & Hall/CRC Data Science Series, 2020. Print.
  9. Baker, & Shea. "Basketball Analytics: Objective and Efficient Strategies for Understanding How Teams Win, 2013." Print.
  10. Shea. "Basketball Analytics: Spatial Tracking." Kindle.
  11. Oliver, & Alamar. "Sports Analytics: A Guide for Coaches, Managers, and Other Decision Makers, 2013." Print.
  12. Numpy
  13. Pandas
  14. SciPy
  15. MatplotLib
  16. Seaborn
  17. Python
  18. Plotly
  19. Bokeh
  20. RStudio
  21. MiniTab
  22. Anaconda

Part-C: Solution¶

DOMAIN: Startup ecosystem

CONTEXT: Company X is a EU online publisher focusing on the startups industry. The company specifically reports on the business related to technology news, analysis of emerging trends and profiling of new tech businesses and products. Their event i.e. Startup Battlefield is the world’s pre-eminent startup competition. Startup Battlefield features 15-30 top early stage startups pitching top judges in front of a vast live audience, present in person and online.

DATA DESCRIPTION: CompanyX_EU.csv - Each row in the dataset is a Start-up company and the columns describe the company.

DATA DICTIONARY:

  1. Startup: Name of the company
  2. Product: Actual product
  3. Funding: Funds raised by the company in USD
  4. Event: The event the company participated in
  5. Result: Described by Contestant, Finalist, Audience choice, Winner or Runner up
  6. OperatingState: Current status of the company, Operating ,Closed, Acquired or IPO

*Dataset has been downloaded from the internet. All the credit for the dataset goes to the original creator of the data.

PROJECT OBJECTIVE: Analyse the data of the various companies from the given dataset and perform the tasks that are specified in the below steps. Draw insights from the various attributes that are present in the dataset, plot distributions, state hypotheses and draw conclusions from the dataset.

1. Read the CSV file¶

In [104]:
compx=pd.read_csv("Compx.csv");
In [105]:
compx.head()
Out[105]:
Startup Product Funding Event Result OperatingState
0 2600Hz 2600hz.com NaN Disrupt SF 2013 Contestant Operating
1 3DLT 3dlt.com $630K Disrupt NYC 2013 Contestant Closed
2 3DPrinterOS 3dprinteros.com NaN Disrupt SF 2016 Contestant Operating
3 3Dprintler 3dprintler.com $1M Disrupt NY 2016 Audience choice Operating
4 42 Technologies 42technologies.com NaN Disrupt NYC 2013 Contestant Operating
In [106]:
compx.tail()
Out[106]:
Startup Product Funding Event Result OperatingState
657 Zivity zivity.com $8M TC40 2007 Contestant Operating
658 Zmorph zmorph3d.com $1M - Audience choice Operating
659 Zocdoc zocdoc.com $223M TC40 2007 Contestant Operating
660 Zula zulaapp.com $3.4M Disrupt SF 2013 Audience choice Operating
661 Zumper zumper.com $31.5M Disrupt SF 2012 Finalist Operating
In [107]:
compx.shape
Out[107]:
(662, 6)

2. Data Exploration¶

2A. Check the datatypes of each attribute.¶

In [108]:
compx.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 662 entries, 0 to 661
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Startup         662 non-null    object
 1   Product         656 non-null    object
 2   Funding         448 non-null    object
 3   Event           662 non-null    object
 4   Result          662 non-null    object
 5   OperatingState  662 non-null    object
dtypes: object(6)
memory usage: 31.2+ KB

2B. Check for null values in the attributes.¶

In [109]:
def missing_check(df):
    total = df.isnull().sum().sort_values(ascending=False)   # total number of null values
    percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)  # percentage of values that are null
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])  # putting the above two together
    return missing_data # return the dataframe
missing_check(compx)
Out[109]:
Total Percent
Funding 214 0.323263
Product 6 0.009063
Startup 0 0.000000
Event 0 0.000000
Result 0 0.000000
OperatingState 0 0.000000

3. Data preprocessing & visualisation:¶

3A. Drop the null values.¶

In [110]:
compx1 = compx.dropna(subset=['Funding', 'Product'])
In [111]:
compx1
Out[111]:
Startup Product Funding Event Result OperatingState
1 3DLT 3dlt.com $630K Disrupt NYC 2013 Contestant Closed
3 3Dprintler 3dprintler.com $1M Disrupt NY 2016 Audience choice Operating
5 5to1 5to1.com $19.3M TC50 2009 Contestant Acquired
6 8 Securities 8securities.com $29M Disrupt Beijing 2011 Finalist Operating
10 AdhereTech adheretech.com $1.8M Hardware Battlefield 2014 Contestant Operating
... ... ... ... ... ... ...
657 Zivity zivity.com $8M TC40 2007 Contestant Operating
658 Zmorph zmorph3d.com $1M - Audience choice Operating
659 Zocdoc zocdoc.com $223M TC40 2007 Contestant Operating
660 Zula zulaapp.com $3.4M Disrupt SF 2013 Audience choice Operating
661 Zumper zumper.com $31.5M Disrupt SF 2012 Finalist Operating

446 rows × 6 columns

In [112]:
compx1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 446 entries, 1 to 661
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Startup         446 non-null    object
 1   Product         446 non-null    object
 2   Funding         446 non-null    object
 3   Event           446 non-null    object
 4   Result          446 non-null    object
 5   OperatingState  446 non-null    object
dtypes: object(6)
memory usage: 24.4+ KB

3B. Convert the ‘Funding’ features to a numerical value.¶

In [113]:
compx1.loc[:,'Funds_in_million'] = compx1['Funding'].apply(lambda x: float(x[1:-1])/1000 if x[-1] == 'K' 
                                                           else (float(x[1:-1])*1000 if x[-1] == 'B' 
                                                           else float(x[1:-1])))
In [114]:
compx1
Out[114]:
Startup Product Funding Event Result OperatingState Funds_in_million
1 3DLT 3dlt.com $630K Disrupt NYC 2013 Contestant Closed 0.63
3 3Dprintler 3dprintler.com $1M Disrupt NY 2016 Audience choice Operating 1.00
5 5to1 5to1.com $19.3M TC50 2009 Contestant Acquired 19.30
6 8 Securities 8securities.com $29M Disrupt Beijing 2011 Finalist Operating 29.00
10 AdhereTech adheretech.com $1.8M Hardware Battlefield 2014 Contestant Operating 1.80
... ... ... ... ... ... ... ...
657 Zivity zivity.com $8M TC40 2007 Contestant Operating 8.00
658 Zmorph zmorph3d.com $1M - Audience choice Operating 1.00
659 Zocdoc zocdoc.com $223M TC40 2007 Contestant Operating 223.00
660 Zula zulaapp.com $3.4M Disrupt SF 2013 Audience choice Operating 3.40
661 Zumper zumper.com $31.5M Disrupt SF 2012 Finalist Operating 31.50

446 rows × 7 columns

3C. Plot box plot for funds in million.¶

In [115]:
plt.figure(figsize=(20,5))
ax = sns.boxplot(x=compx1["Funds_in_million"])
No description has been provided for this image
In [116]:
# Adding swarmplot to boxplot for better visualization

plt.figure(figsize=(20,8))
ax = sns.boxplot(x=compx1["Funds_in_million"], whis=np.inf)
ax = sns.swarmplot(x=compx1["Funds_in_million"], color=".2")

# We can also use jitter for better visualization
# ax = sns.stripplot(x=compx1["Funds_in_million"], color=".2", jitter=0.3)
No description has been provided for this image

3D. Check the number of outliers greater than the upper fence.¶

In [117]:
# Calculate the IQR
cols = ['Funds_in_million']
Q1 = compx1[cols].quantile(0.25)
Q3 = compx1[cols].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
Funds_in_million    8.72975
dtype: float64
In [118]:
Outliers = np.where(compx1[cols] > (Q3 + 1.5 * IQR))
Outliers
Out[118]:
(array([  3,  16,  21,  27,  33,  67,  71,  82,  84,  85,  87,  88, 102,
        111, 117, 121, 122, 127, 129, 140, 144, 146, 153, 157, 183, 187,
        189, 192, 205, 208, 212, 221, 230, 260, 263, 264, 267, 286, 291,
        294, 311, 318, 365, 371, 375, 389, 395, 400, 403, 409, 418, 432,
        433, 436, 437, 438, 439, 440, 443, 445], dtype=int64),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64))
In [119]:
# Rows without outliers
NonOutliers = compx1[cols][~((compx1[cols] > (Q3 + 1.5 * IQR))).any(axis=1)]
NonOutliers
Out[119]:
Funds_in_million
1 0.63
3 1.00
5 19.30
10 1.80
11 1.00
... ...
645 10.00
646 0.90
657 8.00
658 1.00
660 3.40

386 rows × 1 columns

In [120]:
NonOutliers.shape, compx1.shape
Out[120]:
((386, 1), (446, 7))
In [121]:
NumberOfOutliers = compx1.shape[0] - NonOutliers.shape[0]
NumberOfOutliers
Out[121]:
60

3E. Check frequency of the OperatingState features classes.¶

In [122]:
compx1["OperatingState"].value_counts()
Out[122]:
Operating    319
Acquired      66
Closed        57
Ipo            4
Name: OperatingState, dtype: int64
In [123]:
plt.figure(figsize=(10,5))
ax = sns.countplot(x="OperatingState", data=compx1)
No description has been provided for this image
In [124]:
compx1.shape[0]
Out[124]:
446

4. Statistical Analysis¶

4A. Is there any significant difference between Funds raised by companies that are still operating vs companies that closed down?¶

In [125]:
plt.figure(figsize=(10,5))
sns.barplot(x ='OperatingState', y='Funds_in_million' ,data = compx1)
plt.title("Operating State vs Funds in Million",size=15)
plt.show()
No description has been provided for this image
In [126]:
#Understand the distribution of 'Funds_in_million' variable
plt.figure(figsize=(10,5))
ax = sns.histplot(compx1['Funds_in_million'], log_scale=True, kde=True)
No description has been provided for this image
In [127]:
# Companies Operating vs Companies Closed

CompOperating = compx1[['Funds_in_million','OperatingState']][compx1['OperatingState']=='Operating']
CompClosed = compx1[['Funds_in_million','OperatingState']][compx1['OperatingState']=='Closed']
In [128]:
CompOperating.head()
Out[128]:
Funds_in_million OperatingState
3 1.00 Operating
6 29.00 Operating
10 1.80 Operating
13 1.00 Operating
14 0.75 Operating
In [129]:
CompOperating.describe()
Out[129]:
Funds_in_million
count 319.000000
mean 19.062186
std 104.677241
min 0.005000
25% 0.745500
50% 2.200000
75% 8.250000
max 1700.000000
In [130]:
plt.figure(figsize=(10,5))
ax = sns.histplot(CompOperating['Funds_in_million'], log_scale=True, kde=True)
No description has been provided for this image
In [131]:
CompClosed.head()
Out[131]:
Funds_in_million OperatingState
1 0.6300 Closed
11 1.0000 Closed
17 0.9292 Closed
20 1.9000 Closed
34 0.6000 Closed
In [132]:
CompClosed.describe()
Out[132]:
Funds_in_million
count 57.000000
mean 3.258170
std 5.923294
min 0.093000
25% 0.475000
50% 0.929200
75% 3.500000
max 35.500000
In [133]:
plt.figure(figsize=(10,5))
ax = sns.histplot(CompClosed['Funds_in_million'], log_scale=True, kde=True)
No description has been provided for this image

Important Insights:

  • From the graphical and descriptive statistics it can be concluded that there is significant difference between the Funds raised by the companies that are operating compared to the companies that are closed.

  • More would be validated using the Hypothesis Testing.

4B. Write the null hypothesis and alternative hypothesis.¶

The two hypotheses for this particular two sample t-test are as follows:

  • µ1 = Mean funding for companies that are Operating
  • µ2 = Mean funding for companies that are Closed
  • Null Hypothesis, H0: µ1 = µ2 (the two population means are equal)
  • Alternate Hypothesis, HA: µ1 ≠ µ2 (the two population means are not equal)

4C. Test for significance and conclusion¶

Assumptions:

  • Observations in two groups have an approximately normal distribution (Shapiro-Wilks Test)

  • Homogeneity of variances (variances are equal between treatment groups) (Levene or Bartlett Test); It's different in our case here.

  • The two groups are sampled independently from each other from the same population

Note: Two sample t-test is relatively robust to the assumption of normality and homogeneity of variances when sample size is large (n ≥ 30) and there are equal number of samples (n1 = n2) in both groups.

If the sample size small and does not follow the normal distribution, We should use non-parametric Mann-Whitney U test (Wilcoxon rank sum test).

Conduct a two sample t-test:

Next, we’ll use the ttest_ind() function from the scipy.stats library to conduct a two sample t-test, which uses the following syntax:

ttest_ind(a, b, equal_var=True)

where:

  • a: an array of sample observations for group 1
  • b: an array of sample observations for group 2
  • equal_var: if True, perform a standard independent 2 sample t-test that assumes equal population variances. If False, perform Welch’s t-test, which does not assume equal population variances. This is True by default.

Before we perform the test, we need to decide if we’ll assume the two populations have equal variances or not. As a rule of thumb, we can assume the populations have equal variances if the ratio of the larger sample variance to the smaller sample variance is less than 4:1.

In [134]:
# Find variance for each group
var1, var2 = np.var(CompOperating['Funds_in_million']), np.var(CompClosed['Funds_in_million'])
print(var1, var2)
10922.975787130521 34.469877948057864
In [135]:
VarianceRatio = var1/var2
VarianceRatio
Out[135]:
316.8846667687709
In [136]:
# Lets use the Unequal Variance Case

stats.ttest_ind(CompOperating['Funds_in_million'], CompClosed['Funds_in_million'], equal_var=False)
Out[136]:
Ttest_indResult(statistic=2.6727214329249227, pvalue=0.007898898722704632)

Intepretation:

  1. Because the p-value of our test (0.00789) is less than alpha = 0.05, we reject the null hypothesis of the test.

  2. We do have sufficient evidence to say that there is a significant difference between the funds raised by the companies that are operating vs the companies that are closed.

4D. Make a copy of the original data frame.¶

In [137]:
# Make a copy of the original dataset
compxOriginal=compx.copy(deep=True);
In [138]:
compxOriginal.head()
Out[138]:
Startup Product Funding Event Result OperatingState
0 2600Hz 2600hz.com NaN Disrupt SF 2013 Contestant Operating
1 3DLT 3dlt.com $630K Disrupt NYC 2013 Contestant Closed
2 3DPrinterOS 3dprinteros.com NaN Disrupt SF 2016 Contestant Operating
3 3Dprintler 3dprintler.com $1M Disrupt NY 2016 Audience choice Operating
4 42 Technologies 42technologies.com NaN Disrupt NYC 2013 Contestant Operating
In [139]:
compxOriginal.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 662 entries, 0 to 661
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Startup         662 non-null    object
 1   Product         656 non-null    object
 2   Funding         448 non-null    object
 3   Event           662 non-null    object
 4   Result          662 non-null    object
 5   OperatingState  662 non-null    object
dtypes: object(6)
memory usage: 31.2+ KB

4E. Check frequency distribution of Result variables.¶

In [140]:
compxOriginal["Result"].value_counts()
Out[140]:
Contestant         488
Finalist            84
Audience choice     41
Winner              26
Runner up           23
Name: Result, dtype: int64
In [141]:
plt.figure(figsize=(10,5))
ax = sns.countplot(x="Result", data=compxOriginal)
No description has been provided for this image

4F. Calculate percentage of winners that are still operating and percentage of contestants that are still operating¶

In [142]:
# Companies Operating vs Companies Closed

winnerOp = compxOriginal[['Result','OperatingState']][(compxOriginal['OperatingState']=='Operating') 
                                                    & (compxOriginal['Result']=='Winner')]
contestantOp = compxOriginal[['Result','OperatingState']][(compxOriginal['OperatingState']=='Operating') 
                                                    & (compxOriginal['Result']=='Contestant')]
In [143]:
winnerOp.head()
Out[143]:
Result OperatingState
13 Winner Operating
127 Winner Operating
166 Winner Operating
209 Winner Operating
244 Winner Operating
In [144]:
winnerOp.describe()
Out[144]:
Result OperatingState
count 19 19
unique 1 1
top Winner Operating
freq 19 19
In [145]:
contestantOp.head()
Out[145]:
Result OperatingState
0 Contestant Operating
2 Contestant Operating
4 Contestant Operating
7 Contestant Operating
8 Contestant Operating
In [146]:
contestantOp.describe()
Out[146]:
Result OperatingState
count 332 332
unique 1 1
top Contestant Operating
freq 332 332
In [147]:
# Operating winner and contestant

winnerContestantOp = 19 + 332
winnerContestantOp
Out[147]:
351
In [148]:
# Using the data from above analysis

winnerOpPercent, contestantOpPercent = 19/351, 332/351
print(winnerOpPercent, contestantOpPercent)
0.05413105413105413 0.9458689458689459

4G. Write your hypothesis comparing the proportion of companies that are operating between winners and contestants:¶

The two hypotheses for this particular two sample z-test are as follows:

  • P1 = Proportion of companies that are operating and winners
  • P2 = Proportion of companies that are operating and contestants
  • Null Hypothesis, H0: P1 = P2 (the two population proportions are equal)
  • Alternate Hypothesis, HA: P1 ≠ P2 (the two population proportions are not equal)

4H. Test for significance and conclusion¶

In [149]:
# Assuming independent samples, large sample sizes, and the hypothesized population proportion difference as zero.

from statsmodels.stats.proportion import proportions_ztest
count = np.array([19, 332])
nobs = np.array([351, 351])
stat, pval = proportions_ztest(count, nobs)
print('{0:0.3f}'.format(pval), '{0:0.3f}'.format(stat))
0.000 -23.627

Intepretation:

  1. Because the p-value of our test (0.000) is less than alpha = 0.05, we reject the null hypothesis of the test.

  2. We do have sufficient evidence to say that there is a significant difference between the proportion of operating companies in two classes like winners and contestants.

4I. Select only the Event that has ‘disrupt’ keyword from 2013 onwards.¶

In [150]:
# Usng the str, contains functions, We can solve it

CompDisrupt2013 = compxOriginal[compxOriginal['Event'].str.contains('Disrupt') & compxOriginal['Event'].str.contains('2013')]
CompDisrupt2013
Out[150]:
Startup Product Funding Event Result OperatingState
0 2600Hz 2600hz.com NaN Disrupt SF 2013 Contestant Operating
1 3DLT 3dlt.com $630K Disrupt NYC 2013 Contestant Closed
4 42 Technologies 42technologies.com NaN Disrupt NYC 2013 Contestant Operating
30 Ansa ansa.com NaN Disrupt SF 2013 Contestant Operating
33 AppArchitect apparchitect.com NaN Disrupt NYC 2013 Contestant Acquired
... ... ... ... ... ... ...
628 Wanderio wanderio.com $275K Disrupt EU 2013 Contestant Operating
637 What Now Travel whatnowtravel.com NaN Disrupt EU 2013 Contestant Operating
641 Workspot workspot.com $15.8M Disrupt NYC 2013 Contestant Operating
656 Zenefits zenefits.com $583.6M Disrupt NYC 2013 Finalist Operating
660 Zula zulaapp.com $3.4M Disrupt SF 2013 Audience choice Operating

77 rows × 6 columns